Software Engineering for DS
Source
Software Engineering for Data Scientists, Catherine Nelson, April 2024, O'Reilly Media, Inc.
How to analyze code performance
Find bottlenecks
- timing the code
- profiling the code
- use profiler to get the time breakdown of a script
Handle runtime time complexity
- time complexity: how the running time of an algorithm grows as the size of the input increases
- Big O Notation: the runtime is described as a function of the size of the input,
: means constant time
- think about
- the number of steps the code will take
- how this will scale as the amount of data grows
Using Data Structures Effectively
| Operation | List | Tuple | Dict | Set | NumPy Array | pandas DataFrame |
|---|---|---|---|---|---|---|
| Index Lookup | Yes (O(1)) | Yes (O(1)) | No | No | Yes (O(1)) | Yes (fast) |
| Key Lookup | No (O(n)) | No (O(n)) | Yes (O(1)) | Yes (O(1)) | No (O(n)) | Yes (fast) |
| Search by Value | No (O(n)) | No (O(n)) | No (O(n)) | Yes (O(1)) | Yes (O(n)) | Yes (O(n)) |
| Insert | Yes/O(n) | No | Yes (O(1)) | Yes (O(1)) | No (O(n)) | No (O(n)) |
| Delete | No (O(n)) | No | Yes (O(1)) | Yes (O(1)) | No (O(n)) | No (O(n)) |
| Memory Efficiency | Medium | High | Medium-High | Medium-High | High | Low |
| Best Use | General | Fixed data | Fast lookup | Membership | Numeric ops | Tabular data |
| Key Feature | static; immutable | no inherent order | slow if iterating through rows |
Code Formatting, Linting, and Type Checking
Code formatting
- Python Enhancement Proposal 8 (PEP8)
- Import Formatting: PEP8 sets standards for how to group your imports - imports should be grouped in the following order
- Standard library imports
- Related third-party imports
- Local application/library specific imports
Linting
- Linting = checking your code for errors before it runs
- how to lint
- use some tool packages
- IDE automatically does this
Type Checking
= check the type of the input that a function is expecting to avoid potential errors
- use type annotations
- use some tool packages
Testing Code
Basic test structure (from Pytest documentation)
- arrange: set up everything needed to run the function
- act: run the function
- assert: check if the result is as expected
- cleanup: cleanup the testing trace
Types of Tests
- Unit Tests: check that units of code, such as functions or classes, do what a developer expects
- Integration tests: check that a larger system is connected correctly
- Acceptance tests: check that the system does what the user expects
- Load tests: check that the system still functions correctly if the amount of data or the number of users increases
- Security tests: check that the system is resistant to attacks
- Usability tests: check that the system is intuitive to use
Testing for DS & ML
- data validation
- test model training & inference
Design & Refactoring
Code Design
- modular code: critical to good design
- each function has a single responsibility -> easy to maintain
- robust code: capable of handling errors gracefully, preventing crashes and unexpected behaviors and results
- Framework
- function name
- inputs
- behavior
- outputs
ML project structure
- example:
├── README.md
├── requirements.txt
│
├── notebooks
│ ├── explore_data.ipynb
│ └── try_regression_model.ipynb
│
├── src
│ ├── __init__.py
│ ├── load_data.py
│ ├── feature_engineering.py
│ ├── model_training.py
│ ├── model_analysis.py
│ └── utils.py
|
├── tests
│ ├── test_load_data.py
│ ├── test_feature_engineering.py
│ ├── test_model_training.py
│ ├── test_model_analysis.py
│ └── test_utils.py
Refactoring
Refactoring is the process of changing a software system in a way that does not alter the external behavior of the code, yet improves its internal structure. —Martin Fowler
Documentation
- Names
- Variables and functions should use snake_case
- Class definitions should use CamelCase
- Constants or global variables should use ALL_CAPS
- Comments
- Docstrings
- READMEs
- Experiment tracking
Deployment
Containers & Dockers
- container = an isolated place to run your API or any other application
- docker = a system for building and managing containers
Cloud deployment
- common main steps
- create a Docker container on your local machine that contains the API code and details of the libraries
- upload this container to a container registry on a cloud profiler’s system (AWS, Microsoft Azure, or Google Cloud)
- deploy containers: instruct the cloud provider to run the chosen container
Security
What is security for software
- protecting a system from theft of information, damage, disruption, or unwanted access to information
Commonly used terms
- attacker: Wants to steal data from or disrupt a software system
- threat: An event that has the potential to harm users of a software system or the company that runs the system
- Vulnerability: A weakness in a software system that could be exploited by an attacker
- Risk: The likelihood and severity of a threat being carried out
Security risks & practices
- login credentials being stolen <- avoid committing secrets or data
- risk of third-party packages <- version control
- risk of pickle module (attackers could plant any code in a pickle file) <- use JSON instead
- use static analysis tools to scan code without running it
- for production code: security reviews and policies
- specific security issues for ML (on deployed models or training data) <- validating training data, monitoring the model, and being able to rapidly deploy a new model if an issue is found
Working in Software
The Software Development Lifecycle
- plan
- design
- build
- test
- deploy
- maintain
Software development
- Waterfall Software Development: each step in the project lifecycle is carried out following the end of the previous one
- Agile Software Development: focuses on frequent updates to software
- emphasize flexibility, rapid feedback from customers/users, and changing direction in response to events instead of making detailed plans for development
- teams are often cross-functional
- designers and analysts as well as developers
- a product owner who represents the point of view of the users or customers
- Agile methodologies: scrum, Kanban
Technical roles in software inductry
- Software Engineers: Write code and develop processes to build and maintain products.
- QA Engineers: Test the product to check that it works well for all users.
- Data Engineers: Build and maintain data pipelines to transform raw data into a format that data scientists and analysts can use.
- Data Analysts: Select, clean, analyze, and interpret data to derive insights from it.
- Product Managers: Plan and organize the requirements and roadmap for development work.
- UX Researchers: Research and analyze the needs of a product’s users.
- UI/UX Designers: Design the overall look (UI) and feel (UX) of a product.
My side notes
Jupyter Notebook version control solutions
- GitHub’s Notebook Diff Viewer
- GitHub now supports native diffs for .ipynb files in PRs. It shows:
- Added/removed cells
- Code-level diffs
- Rendered markdown diffs
- Output diffs (optional)
- GitHub now supports native diffs for .ipynb files in PRs. It shows:
- Use
jupytext- Automatically sync
.ipynbwith a.pyor.mdfile so Git diffs are readable. - Example: notebook
analysis.ipynb↔ paired fileanalysis.py - Git will show diffs on the
.pyfile, not the JSON.
- Automatically sync
- Use
nbstripout- Removes outputs before committing so diffs stay clean.
- Effect:
- Removes images, outputs, execution counts
- Leaves only pure code + markdown